Termination Criteria for Solving 
Concurrent Safety and Reachability Games 



Krishnendu Chatterjee^, Luca de Alfaro^, and Thomas A. Henzinger' 

^ CE, University of California, Santa Cruz, USA 
EECS, University of California, Berlceley, USA 



,2,3 



3 



EPFL, Switzerland 



{c.krish, tah}@eecs.berkeley.edu, luca@soe.ucsc.edu 



Abstract. We consider concurrent games played on graphs. At every round of a game, each 
player simultaneously and independently selects a move; the moves jointly determine the tran- 
sition to a successor state. Two basic objectives are the safety objective to stay forever in a given 
set of states, and its dual, the reachability objective to reach a given set of states. We present 
in this paper a strategy improvement algorithm for computing the value of a concurrent safety 
game, that is, the maximal probability with which player 1 can enforce the safety objective. The 
algorithm yields a sequence of player- 1 strategies which ensure probabilities of winning that 
converge monotonically to the value of the safety game. 

Our result is significant because the strategy improvement algorithm provides, for the first time, 
a way to approximate the value of a concurrent safety game from below. Since a value iter- 
ation algorithm, or a strategy improvement algorithm for reachability games, can be used to 
approximate the same value from above, the combination of both algorithms yields a method 
for computing a converging sequence of upper and lower bounds for the values of concurrent 
reachability and safety games. Previous methods could approximate the values of these games 
only from one direction, and as no rates of convergence are known, they did not provide a 
practical way to solve these games. 



1 Introduction 

We consider games played between two players on graphs. At every round of the game, each of the 
two players selects a move; the moves of the players then determine the transition to the successor 
state. A play of the game gives rise to a path in the graph. We consider the two basic objectives for 
the players: reachability and safety. The reachability goal asks player 1 to reach a given set of target 
states or, if randomization is needed to play the game, to maximize the probability of reaching the 
target set. The safety goal asks player 2 to ensure that a given set of safe states is never left or, if 
randomization is required, to minimize the probability of leaving the target set. The two objectives 
are dual, and the games are determined: the maximal probability with which player 1 can reach the 
target set is equal to one minus the maximal probability with which player 2 can confine the game 
to the complement of the target set [17]. 

These games on graphs can be divided into two classes: turn-based and concurrent. In turn-based 
games, only one player has a choice of moves at each state; in concurrent games, at each state both 
players choose a move, simultaneously and independently, from a set of available moves. For turn- 
based games, the solution of games with reachability and safety objectives has long been known. If 
each move determines a unique successor state, then the games are P-complete and can be solved 
in linear-time in the size of the game graph. If, more generally, each move determines a probability 
distribution on possible successor states, then the problem of deciding whether a turn-based game 



can be won with probability greater than a given threshold p G [0, 1] is in NP n co-NP [4], and 
the exact value of the game can be computed by a strategy improvement algorithm [5], which works 
well in practice. These results all depend on the fact that in turn-based reachability and safety games, 
both players have optimal deterministic (i.e., no randomization is required), memoryless strategies. 
These strategies are functions from states to moves, so they are finite in number, and this guarantees 
the termination of the strategy improvement algorithm. 

The situation is very different for concurrent games, where randomization is required even in 
the special case in which the transition function is deterministic. The player- 1 value of the game is 
defined, as usual, as the sup-inf value: the supremum, over all strategies of player 1, of the infimum, 
over all strategies of player 2, of the probability of achieving the reachability or safety goal. In con- 
current reachabiUty games, player 1 is guaranteed only the existence of e-optimal strategies, which 
ensure that the value of the game is achieved within a specified tolerance e > [16]. Moreover, while 
these strategies (which depend on e) are memoryless, in general they require randomization [9]. For 
player 2 (the safety player), optimal memoryless strategies exist [10], which again require random- 
ization. All of these strategies are functions from states to probability distributions on moves. The 
question of deciding whether a concurrent game can be won with probability greater than p is in 
PSPACE; this is shown by reduction to the theory of the real-closed fields [12], but no practical 
algorithms were known. 

To summarize: while practical strategy improvement algorithms are available for turn-based 
reachabiUty and safety games, so far no practical algorithms or even approximation schemes were 
known for concurrent games. If one wanted to compute the value of a concurrent game within a 
specified tolerance e > 0, one was reduced to using a binary search algorithm that approximates 
the value by iterating queries in the theory of the real-closed fields. Strategy improvement and value 
iteration schemes were known for such games, but they could be used to approximate the value from 
one direction only, for reachability goals from below, and for safety goals from above [10, 2]. Neither 
scheme is guaranteed to terminate. Worse, since no convergence rates are known for these schemes, 
they provide no termination criteria for approximating a value within e. 

In this paper, we present for the first time a strategy improvement scheme that approximates 
the value of a concurrent safety game from below. Strategy improvement algorithms are generally 
practical, and together with the known strategy improvement scheme, or the value iteration scheme, 
to approximate the value of such a game from above, we obtain a termination criterion for computing 
the value of concurrent reachability and safety games within any given tolerance e > 0. This is the 
first termination criterion for an algorithm that approximates the value of a concurrent game. 

Several difficulties had to be overcome in developing our scheme. First, while the strategy im- 
provement algorithm that approximates reachability values from below [2] is based on locally im- 
proving a strategy on the basis of the valuation it yields, this approach does not suffice for approxi- 
mating safety values from below: we would obtain an increasing sequence of values, but they would 
not necessarily converge to the value of the game (see Example 1). Rather, we introduce a novel, 
non-local improvement step, which augments the standard valuation-based improvement step. Each 
non-local step involves the solution of an appropriately constructed turn-based game. Second, as 
value iteration for safety objectives converges from above, while our sequences of strategies yield 
values that converge from below, the proof of convergence for our algorithm cannot be derived from 
a coimection with value iteration, as was the case for reachability objectives. We had to develop new 
proof techniques both to show the monotonicity of the strategy values produced by our algorithm, 
and to show their convergence to the value of the game. 

We also present a detailed analysis of termination criteria for turn-based stochastic games. Our 
analysis is based on the strategy improvement algorithm for reachability games, and bound on the 
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precision of values for turn-based stochastic games. As a consequence of our analysis, we obtain an 
improved upper bound for termination for turn-based stochastic games. 

2 Definitions 

Notation. For a countable set A, a probability distribution on ^ is a function S: A ^ [0,1] such that 
J2aeA ^i^) — 1- denote the set of probability distributions on A by ^{A). Given a distribution 
6 e T^iA), we denote by Supp{S) = {x £ A \ d{x) > 0} the support set of i5. 

Definition 1 (Concurrent games). A (two-player) concurrent game structure G = 
{S, M, A, A, consists of the following components: 

- A finite state space S and a finite set M of moves or actions. 

- Two move assignments /"i , / 2 : S ^ 2^ \ %. For i G {1,2}, assignment Fi associates with each 
state s € S a nonempty set Fi{s) C M of moves available to player i at state s. 

- A probabilistic transition function 6 : S x M x M — > ^^{S) that gives the probability 
S{s, Oi, 02) (i) of a transition from s to t when player 1 chooses at state s move ai and player 2 
chooses move 02, for all s,t € S and ai € A (s), 02 G F2 (s). 

We denote by \5\ the size of transition function, i.e., \5\ = ^sgS aeri(s) 6Gr2(s) tG5 
where \5{s, a, b){t)\ is the number of bits required to specify the transition probability 6{s, a, b){t). 
We denote by |G| the size of the game graph, and |G'| = |(5| + |S'|.At every state s € S, player 1 
chooses a move ai E r^is), and simultaneously and independently player 2 chooses a move 02 G 
/2(s). The game then proceeds to the successor state t with probability (5 (s, ai, a2)(t), for all t £ S. 
A state s is an absorbing state if for all ai G A(s) and 02 G -^(s), we have 5{s, ai,a2){s) = 1. In 
other words, at an absorbing state s for all choices of moves of the two players, the successor state 
is always s. 

Definition 2 (Turn-based stocliastic games). A turn-based stochastic game graph ('2Y2-player 
game graphj G = {{S,E), {81,82, 8ii), 6) consists of a finite directed graph {8, E), a partition 
{8\, 82, 8i{) of the finite set 8 of states, and a probabilistic transition function 5: 8jt f (5*), 

where T>{S) denotes the set of probability distributions over the state space 8. The states in 81 
are the player- 1 states, where player 1 decides the successor state; the states in 82 are the player-2 
states, where player 2 decides the successor state; and the states in 8n are the random or probabilis- 
tic states, where the successor state is chosen according to the probabilistic transition function 5. 
We assume that for s G 811 and tG8,we have {s, t) G E iffS{s){t) > 0, and we often write 5{s, t) 
for 5{s){t). For technical convenience we assume that every state in the graph {8, E) has at least 
one outgoing edge. For a state s G 8, we write E{s) to denote the set {t £ 8 \ (.s, t) G E} of possi- 
ble successors. We denote by \5\ the size of the transition function, i.e., \5\ = X^sgSr teS 
where \5{s) (t) | is the number of bits required to specify the transition probability d{s) {t). We denote 
by \G\ the size of the game graph, and \G\ = \6\ -\- \8\ -\- \E\. 

Plays. A play w of G is an infinite sequence 10 = {sq, Si, S2, ■ ■ ■) of states in 8 such that for all 
fc > 0, there are moves G A(sfe) and a2 G -A(s/s) with 6{sk, at, a2)(sfc+i) > 0. We denote by 
n the set of all plays, and by Qg the set of all plays w = {sq, Si, S2, ■ ■ ■) such that sq = s, that is, 
the set of plays starting from state s. 
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Selectors and strategies. A selector ^ for player i G {1, 2} is a function ^ : S ^ 'D{M) such that 
for all states s G S and moves a e M, if ^(s)(a) > 0, then a e i^s). A selector ^ for player 
i at a state s is a distribution over moves such that if ^(s)(a) > 0, then a G r'i(s). We denote by 
Ai the set of all selectors for player i G {1,2}, and similarly, we denote by Ai{s) the set of all 
selectors for player i at a state s. The selector ^ is pure if for every state s G S, there is a move 
a G M such that C{s){a) = 1. A strategy for player i G {1, 2} is a function tt : 5'+ ^ V^M) 
that associates with every finite, nonempty sequence of states, representing the history of the play 
so far, a selector for player i; that is, for all w G 5* and s G S, we have Supp{'K{w ■ s)) C r'i(s). 
The strategy tt is pure if it always chooses a pure selector; that is, for all w G S'^ , there is a move 
a G M such that ■K{w){a) = 1. A memoryless strategy is independent of the history of the play and 
depends only on the current state. Memoryless strategies correspond to selectors; we write ^ for the 
memoryless strategy consisting in playing forever the selector ^. A strategy is pure memoryless if it 
is both pure and memoryless. In a turn-based stochastic game, a strategy for player 1 is a function 
TTi : S* • Si ^ V{S), such that for all w G 5* and for all s G Si we have Supp{wi {w ■ s)) C E{s). 
Memoryless strategies and pure memoryless strategies are obtained as the restriction of strategies as 
in the case of concurrent game graphs. The family of strategies for player 2 are defined analogously. 
We denote by ili and 112 the sets of all strategies for player 1 and player 2, respectively. We denote 
by and Uf^ the sets of memoryless strategies and pure memoryless strategies for player i, 
respectively. 

Destinations of moves and selectors. For all states s G S* and moves ai G A("S) and a2 G r2(s), we 
indicate by Dest{s, ai, 02) = Supp{6{s, ai, 02)) the set of possible successors of s when the moves 
ai and 02 are chosen. Given a state s, and selectors ^1 and ^2 for the two players, we denote by 

Dest{s,^i,^2) = IJ Dest{s,ai,a2) 

a2eSupp(|2(s)) 

the set of possible successors of s with respect to the selectors ^1 and ^2- 

Once a starting state s and strategies tti and tt2 for the two players are fixed, the game is reduced 
to an ordinary stochastic process. Hence, the probabilities of events are uniquely defined, where 
an event AC i?s is a measurable set of plays. For an event A C i?^, we denote by Pi^^''^'^{A) 
the probability that a play belongs to A when the game starts from s and the players follows the 
strategies tti and n2. Similarly, for a measurable function / : i?s — » IR, we denote by EJi''^^ (/) the 
expected value of / when the game starts from s and the players follow the strategies tti and 7r2. For 
i > 0, we denote hy 0i : f2 ^ S the random variable denoting the i-th state along a play. 

Valuations. A valuation is a mapping v : S ^ [0,1] associating a real number v{s) G [0, 1] with 
each state s. Given two valuations w : S* ^ H, we write v < w when v{s) < w{s) for all states 
s G S. For an event A, we denote by Pr'"'-'^'^ (A) the valuation S —>■ [0, 1] defined for all states 
s G Sby (Pr''i'"^(^))(s) = Pr'^'''"'{A). Similarly, for a measurable function f : f2^ [0, 1], we 
denote by E'^i''^^ (/) the valuation S [0, 1] defined for all s G S by (E'^i''^^ (/)) (s) = EJi''^^ (/). 

Reachability and safety objectives. Given a set -F C 5* of safe states, the objective of a safety game 
consists in never leaving F. Therefore, we define the set of winning plays as the set Safe(F) = 
{{sq, Si, S2, ■ ■ ■) E \ Sk E F for all k > 0}. Given a subset T C 5 of target states, the objective 
of a reachability game consists in reaching T. Correspondingly, the set winning plays is Reach(T) = 
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{(so, si,S2,. . .) & \ Sk &T for some A: > 0} of plays that visit T. For all F C S' and T C S, the 
sets Safe(i^) and Reach(T) is measurable. An objective in general is a measurable set, and in this 
paper we would consider only reachability and safety objectives. For an objective <P, the probability 
of satisfying ^ from a state s & S under strategies m and 7r2 for players 1 and 2, respectively, is 
,v2 define the value for player 1 of game with objective from the state s G 5 as 

((l)),ai(^')(s) = sup inf Pr:--^(<?>); 
TTieiii '^2en2 

i.e., the value is the maximal probability with which player 1 can guarantee the satisfaction of ^ 
against all player 2 strategies. Given a player- 1 strategy tti, we use the notation 

((l»:a^,(^)(«) = inf Prr''^^(^). 

A strategy tti for player 1 is optimal for an objective ^ if for all states s G 5, we have 

mm^) = ((l))va,(<?)(s). 

For e > 0, a strategy tti for player 1 is e-optimal if for all states s £ 5*, we have 

ii)):im^)>ii)um{s)-e. 

The notion of values and optimal strategies for player 2 are defined analogously. Reachability and 
safety objectives are dual, i.e., we have Reach(T) = Q\ Safe(5' \ T). The quantitative determinacy 
result of [17] ensures that for all states s G S,we have 

((l))vai(Safe(F))(.s) + ((2))vai(Reach(5 \ F)){s) = 1. 

Theorem 1 (Memoryless determinacy). For all concurrent game graphs G, for all F,T C S, such 
that F = S\T, the following assertions hold. 

1. [13] Memoryless optimal strategies exist for safety objectives Safe{F). 

2. [2, 12 ] For all e > 0, memoryless e-optimal strategies exist for reachability objectives 
Reach{T). 

3- [4] If G is a turn-based stochastic game graph, then pure memoryless optimal strategies exist 
for reachability objectives Reach{T) and safety objectives Safe{F). 

3 Markov Decision Processes 

To develop our arguments, we need some facts about one-player versions of concurrent stochastic 
games, known as Markov decision processes (MDPs) [11, 1]. For i € {1, 2}, a player-i MDP (for 
short, z-MDP) is a concurrent game where, for all states s G 5, we have |r3_,(s)| = 1. Given a 
concurrent game G, if we fix a memoryless strategy corresponding to selector for player 1, the 
game is equivalent to a 2-MDP G^^ with the transition function 

%(s,a2)(i)= X! S{s,ai,a2){t) -CiisJiai), 

oieri(s) 

for all s £ S* and 02 £ 12 (s). Similarly, if we fix selectors ^1 and ^2 for both players in a concurrent 
game G, we obtain a Markov chain, which we denote by G^^ ■ 
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End components. In an MDP, the sets of states that play an equivalent role to the closed recurrent 
classes of Markov chains [15] are called "end components" [6,7]. 



Definitions (End components). An end component of an i-MDP G, for i G {1, 2}, is a subset 
CCS of the states such that there is a selector ^for player i so that C is a closed recurrent class 
of the Markov chain G^. 

It is not difficult to see that an equivalent characterization of an end component C is the following. 
For each state s E C, there is a subset Mj(s) C i^s) of moves such that: 

1. (closed) if a move in Mi{s) is chosen by player i at state s, then all successor states that are 
obtained with nonzero probability lie in C; and 

2. (recurrent) the graph (C, E), where E consists of the transitions that occur with nonzero prob- 
ability when moves in Mi{-) are chosen by player i, is strongly connected. 

Given a play w G i7, we denote by Inf(a;) the set of states that occurs infinitely often along uj. 
Given a set C 2'^ of subsets of states, we denote by Inf(.F) the event {lj \ Inf(a)) G T}. The 
following theorem states that in a 2-MDP, for every strategy of player 2, the set of states that are 
visited infinitely often is, with probabiUty 1, an end component. Corollary 1 follows easily from 
Theorem 2. 

Theorem 2. [7] For a player-1 selector ^i, let C be the set of end components of a 2-MDP G^^ . For 
all player-2 strategies ^2 and all states s € S, we have Pr^i'"^(Inf (C)) = 1. 

Corollary 1. For a player-1 selector ^i, let C be the set of end components of a 2-MDP G^^, and let 
Z = Ucec ^ of states of all end components. For all player-2 strategies 7r2 and all states 

sGS,we have Pij^'''^ {Reach{Z)) = 1. 

MDPs with reachability objectives. Given a 2-MDP with a reachability objective Reach(r) for 
player 2, where T C S, the values can be obtained as the solution of a linear program [13]. The 
linear program has a variable x{s) for all states s € S, and the objective function and the constraints 
are as follows: 



The correctness of the above linear program to compute the values follows from [11,13]. 
4 Strategy Improvement for Safety Games 

In this section we present a strategy improvement algorithm for concurrent games with safety objec- 
tives. The algorithm will produce a sequence of selectors 70 , 71 , 72 , • • • for player 1 , such that: 



min 




x{s) subject to 



X' 



(s) > ^ x{t) ■ S{s, 02) (i) for all s G 5 and 02 G r2{s) 

tes 

x{s) = 1 for all s G T 

< x(s) < 1 for all s G 5 



1. for all z > 0, we have ((1)), 



(Safe(i^))<((l)):;r(Safe(i^)); 
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2. if there is i > such that 7, = 7^+1, then ((l))J;|(Safe(i^)) = ((l))vai(Safe(F)); and 

3. lim,^oo((l))l',(Safe(F)) = ((l))vai(Safe(F)). 

Condition 1 guarantees that the algorithm computes a sequence of monotonically improving selec- 
tors. Condition 2 guarantees that if a selector cannot be improved, then it is optimal. Condition 3 
guarantees that the value guaranteed by the selectors converges to the value of the game, or equiva- 
lently, that for all £ > 0, there is a number i of iterations such that the memoryless player- 1 strategy 
7j is e-optimal. Note that for concurrent safety games, there may be no i > such that 7^ = 7i+i, 
that is, the algorithm may fail to generate an optimal selector This is because there are concurrent 
safety games such that the values are irrational [10]. We start with a few notations 

The Pre operator and optimal selectors. Given a valuation v, and two selectors ^1 G Ai and 
^2 G A2, we define the valuations Pre^^^^jl^)' ^^'^ Prei{v) as follows, for all states 

sgS: 

Prei,,avKs)= Y^v{t)-S{s,a,b){t)-^,{s)ia)-Usm 
a,beM tes 

Prei:M{s) = inf Pre^„i^{v){s) 

?2fc/l2 

Prei{v){s) = sup inf Pre^^ j2(t;)(s) 

iiGAi S2&A2 

Intuitively, Prei (v) (s) is the greatest expectation of v that player 1 can guarantee at a successor state 
of s. Also note that given a valuation v, the computation of Prei (v) reduces to the solution of a zero- 
sum one-shot matrix game, and can be solved by linear programming. Similarly, Prei-^-^ {v)(s) is 
the greatest expectation of v that player 1 can guarantee at a successor state of s by playing the 
selector ^1. Note that all of these operators on valuations are monotonic: for two valuations VjW, 
if V < w, then for all selectors ^1 e Ai and ^2 G ^2, we have Pre^-^^^2{v) < Pre^^^^2{'w), 
Prei.^^{v) < Prei;^^{w), and Prei{v) < Prei(w). Given a valuation v and a state s, we define 
by 

OptSel(t;,s) = {a G Ai{s) \ Pre^Mis) = Prei{v)is)} 

the set of optimal selectors for v at state s. For an optimal selector ^1 e OptSel(t;, s), we define the 
set of counter-optimal actions as follows: 

CountOpt(?;,s,a) = {b e r2(s) | Pre^,,biv){s) = Prei (?;)(«)}. 

Observe that for a G OptSel(i', s), for all b G ^2(5) \ CountOpt(t', s, ^i) we have Pre^^^b{v){s) > 
Prei{v){s). We define the set of optimal selector support and the counter-optimal action set as 
follows: 

OptSelCount(w,s) = {{A,B) C Pi(s) x r2(s) | 3^ G yli(s). Ci G OptSel(t;,s) 
A Supp{^i) = A A CountOpt(t;,s,a) = B}; 

i.e., it consists of pairs {A, B) of actions of player 1 and player 2, such that there is an optimal 
selector ^ with support A, and B is the set of counter-optimal actions to^i. 

Tlirn-based reduction. Given a concurrent game G = {S, M, Pi, P2, S) and a valuation v we con- 
struct a turn-based stochastic game G„ = {{S, E), (Si, S2, Sn), <5) as follows: 
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1. The set of states is as follows: 

S = SU {{s, A,B)\s€S, {A, B) e OptSelCount(t;, s)} 
U {{s,A,h) I s e 5, {A,B) e OptSelCount(w,s), heB). 

2. The state space partition is as follows: Si = S; S2 = {{s,A,B) \ s S S,{A,B) G 
OptSelCount(?;, s)}; and Sr = S\ (Si U S2). 

3. The set of edges is as follows: 

E = {{s, (s, A, B))\s€ S, {A, B) G OptSelCount(u, s)] 

U {((s, A, B), (s, A, b))\beB}U {{{s, A, b),t)\te\J Dest{s, a, b)}. 

4. The transition function i5 for all states in Sr is uniform over its successors. 

Intuitively, the reduction is as follows. Given the valuation v, state s is a player 1 state where player 1 
can select a pair (A, B) (and move to state (s, A, B)) with A C ri(s) and B C 1^2(5) such that 
there is an optimal selector ^1 with support exactly A and the set of counter-optimal actions to ^1 is 
the set B. From a player 2 state (s, A, B), player 2 can choose any action b from the set B, and move 
to state (s, A, b). A state (s, A, b) is a probabilistic state where all the states in UaeA Dest{s, a, b) 
are chosen uniformly at random. Given a set F C 5* we denote by F = _F U {(.s, A. B) e 5 [ s S 
F} U {(s, yl, 6) G ^ I s G F}. We refer to the above reduction as TB, i.e., (G^, F) = TB(G, v, F). 

Value-class of a valuation. Given a valuation v and a real < r < 1, the value-class Ur{v) of value 
r is the set of states with valuation r, i.e., Ur{v) = {s G S \ v{s) = r} 

4.1 The strategy improvement algorithm 

Ordering of strategies. Let G be a concurrent game and F be the set of safe states. Let T = S \ F. 
Given a concurrent game graph G with a safety objective Safe(F), the set of almost-sure winning 
states is the set of states s such that the value at s is 1, i.e., Wi = {.s G 5 | ((l))vai(Safe(F)) = 1} 
is the set of almost-sure winning states. An optimal strategy from Wi is referred as an almost-sure 
winning strategy. The set Wi and an almost-sure winning strategy can be computed in linear time by 
the algorithm given in [8]. We assume without loss of generality that all states in Wi U T are absorb- 
ing. We define a preorder -< on the strategies for player 1 as follows: given two player 1 strategies tti 
and tt'i, let tti ?:[ if the following two conditions hold: (i) ((l))^;,(Safe(F)) < ((l))^f|(Safe(F)); 

and (ii) ((l))^3^|(Safe(F))(s) < ((l)}^^^ (Safe(F))(s) for some state s € S. Furthermore, we write 
TTi < tt[ if either m -< ttJ or m = 7r[. We first present an example that shows the improvements 
based only on Prei operators are not sufficient for safety games, even on turn-based games and then 
present our algorithm. 

Example 1. Consider the turn-based stochastic game shown in Fig 1, where the □ states are player 1 
states, the O states are player 2 states, and Q states are random states with probabilities labeled on 
edges. The safety goal is to avoid the state s^. Consider a memoryless strategy tti for player 1 that 
chooses the successor sq S2, and the counter- strategy 772 for player 2 chooses si sq. Given 
the strategies tti and tt2, the value at sq, si and S2 is 1/3, and since all successors of sq have value 
1/3, the value cannot be improved by Prei. However, note that if player 2 is restricted to choose 
only value optimal selectors for the value 1 /3, then player 1 can switch to the strategy sq — > S2 and 
ensure that the game stays in the value class 1/3 with probability 1. Hence switching to sq si 
would force player 2 to select a counter- strategy that switches to the strategy si — » S3, and thus 
player 1 can get a value 2/3. ■ 
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Fig. 1. A turn-based stochastic safety game. 



Informal description of Algorithm 1. We now present the strategy improvement algorithm (Algo- 
rithm 1) for computing the values for all states in S'\M^i. The algorithm iteratively improves player- 1 

strategies according to the preorder The algorithm starts with the random selector 70 = ^l"^ that 
plays at all states all actions uniformly at random. At iteration i -|- 1, the algorithm considers the 
memoryless player-1 strategy 7^ and computes the value ((1)) Jai (Safe(F)). Observe that since 7^ is a 
memoryless strategy, the computation of ((1)) Jai (Safe(^')) involves solving the 2-MDP G^^ . The val- 
uation ((l))Ja'|(Safe(F)) is named t^j. For aU states s such that Pre\{vi){s) > Vi{s), the memoryless 
strategy at s is modified to a selector that is value-optimal for Vi. The algorithm then proceeds to the 
next iteration. If Prei (vi) — Vi, then the algorithm constructs the game (Gt,. , F) — TB{G, Vi , F), 
and computes A, as the set of almost-sure winning states in Gy^ for the objective Safe(i^). Let 
U = (At n S)\ Wi. If U is non-empty, then a selector 74+1 is obtained at U from an pure memo- 
ryless optimal strategy (i.e., an almost-sure winning strategy) in Gvi, and the algorithm proceeds to 
iteration z + 1 . If Pre 1 (u, ) = Vi and U is empty, then the algorithm stops and returns the memoryless 
strategy 7^ for player 1. Unlike strategy improvement algorithms for turn-based games (see [5] for a 
survey). Algorithm 1 is not guaranteed to terminate, because the value of a safety game may not be 
rational. 

Lemma 1. Letji and^i+i be the player-1 selectors obtained at iterations i andi + l of Algorithm 1. 
Letl_ = {s e S\ {Wi U T) I Prei{vi){s) > Wi(s)}. Let v, = {{l))^;^{Safe{F)) and v^+i = 

{{1))'^^^^ {Safe{F)). Thenvi+\{s) > Pre\{vi){s) for all states s e S; and therefore Vi+\{s) > Vi{s) 
for all states s G S, and Vi-y.i{s) > Vi{s) for all states s G /. 

Proof. Consider the valuations Vi and Vi+i obtained at iterations i and i + respectively, and let Wi 
be the valuation defined by 'Wi{s) = 1 — Vi{s) for all states s e S*. The counter-optimal strategy for 
player 2 to minimize Vi+i is obtained by maximizing the probability to reach T. Let 



In other words, Wj+i = 1 — Prei{vi), and we also have Wi+i < Wi. We now show that Wi+i is 
a feasible solution to the linear program for MDPs with the objective Reach(T), as described in 
Section 3. Since Vi = ((l))J3|(Safe(F)), it follows that for all states s € S and all moves 02 G -^2(5), 
we have 



tes 

For all states s G S\I,we have 7,(5) = 7i+i(s) and w,+i(s) = Wi{s), and since Wj+i < Wi, it 
follows that for all states s G S\I and all moves 02 G r2 (s), we have 






( for s G 6" \ /). 
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Algorithm 1 Safety Strategy-Improvement Algorithm 



Input: a concurrent game structure G with safe set F. 
Output: a strategy 7 for player 1. 

0. Compute Wi = {s^S\ ({l))vai(Safe(F))(s) = 1}. 

1. Let 70 = ^""'^ and i = 0. 

2. Compute = ((l))^'; (Safe(F)). 

3. do{ 

3.1. Let 7 = {s £S\{WiyjT)\ Prei{vi){s) > Vi{s)}. 

3.2 if 7^0, then 

3.2.1 Let ^1 be a player-1 selector such that for all states s € 7, 
we have Prei:fj (vi)(s) = Prei{vi){s) > Vi{s). 

3.2.2 The player-1 selector 71+1 is defined as follows: for each state s € S, let 

, , /7i(s) if s I\ 
ifs€7. 

3.3 else 

3.3.1 let(G„^,F) = TB(G,i;,,F) _ _ 

3.3.2 let Ai be the set of almost-sure winning states in Gv^ for Safe(F) and 
7fi be a pure memory less almost-sure winning strategy from the set Ai. 

3.3.3if((:4inS') \m ^0) 

3.3.3. 1 let [7 = (Ai 5) \Wi 

3.3.3.2 The player-1 selector 7i+i is defined as follows: for s e S, let 

f 7«(s) if s^U; 
7«+i(s) = < Ci(s) if s e U,^i{s) e OptSel(wi,s), 

[ 7fi(s) = {s,A,B),B = OptSelCount(s,t;,^i). 

3.4. Compute Wi+i = {{l)fj+' {Sale{F)). 

3.5. Let i = i -I- 1. 

} until / = and (Ai-i nS)\Wi=<D. 

4. return 7j. 



Since for s G 7 the selector 7^+1 (s) is obtained as an optimal selector for Prei {vi){s), it follows 
that for all states s G I and all moves 02 G /2 (s), we have 

Preji+i,a2{vi){s) > Prei{vi){s); 

in other words, 1 — Prei{vi){s) > 1 — Pre^._^^^a2{'^i){s)- Hence for all states s G I and all moves 
02 G -^2(5), we have 

Wi+i{s) > ^Wi{t) -(5.^,^1(3,02). 
tes 

Since Wi+i < Wi, for all states s G 7 and all moves 02 € -^2(5), we have 

Wi+i (s) > ^ Wi+i (t) ■ d^,_^^ (s, 02) ( for s G I). 
tes 

Hence it follows that Wi^i is a feasible solution to the linear program for MDPs with reachabiUty 

objectives. Since the reachability valuation for player 2 for Reach(T) is the least solution (observe 
that the objective function of the linear program is a minimizing function), it follows that Vi^i > 
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1 — = Pre\{vi). Thus we obtain Vi+\{s) > Vi{s) for all states s € S, and Vi+i{s) > Vi{s) for 
all states s G /. I 

Recall that by Example 1 it follows that improvement by only step 3.2 is not sufficient to guar- 
antee convergence to optimal values. We now present a lemma about the turn-based reduction, and 
then show that step 3.3 also leads to an improvement. Finally, in Theorem 4 we show that if im- 
provements by step 3.2 and step 3.3 are not possible, then the optimal value and an optimal strategy 
is obtained. 

Lemma 2. Let G be a concurrent game with a set F of safe states. Let v be a valuation and 
consider {Gv,F) = TB(G,v, F). Let A be the set of almost-sure winning states in Gy for the 
objective Safe{F), and let tti be a pure memoryless almost-sure winning strategy from A in Gv 
Consider a memoryless strategy tti in G for states in A D S as follows: iflfi^s) = {s,A,B), 
then iTi{s) e OpX.S>Q\{v,s) such that Supp{'Ki{s)) = A ami OptSelCount(w, s, 7ri(s)) = B. 
Consider a pure memoryless strategy 772 for player 2. If for all states s € A C\ S, we have 
■K2{s) G OptSelCount(u,s,7ri(s)), then for all s gAD S, we have Pi^^'''^ {Safe (F)) = 1. 

Proof. We analyze the Markov chain arising after the player fixes the memoryless strategies tti and 
7r2. Given the strategy 7r2 consider the strategy 7f2 as follows: if 7fi(,s) = (s, A, B) and 7r2(.s) = 
b G OptSelCount(w, s, 7ri(s)), then at state (s, A, B) choose the successor (s, A, b). Since 7fi is an 
almost-sure winning strategy for Safe(i^), it follows that in the Markov chain obtained by fixing 
7fi and 7r2 in Gy, all closed connected recurrent set of states that intersect with A are contained in 
A, and from all states of A the closed connected recurrent set of states within A are reached with 
probabiUty 1. It follows that in the Markov chain obtained from fixing tti and 772 in G all closed 
connected recurrent set of states that intersect with Ar\S are contained m A^S, and from all states 
of A n 5 the closed connected recurrent set of states within A n 5 are reached with probability 1 . 
The desired result follows. I 



Lemma 3. Let 7^ and 7^+1 be the player- 1 selectors obtained at iterations i andi-\-l of Algorithm 1. 

Let I = {s e S \ {Wi U T) I Prei(wO(s) > v,{s)} = 0, and (Ai n S') \ W^i ^ 0. Let v, = 
{{l))^^^iSafe{F)) and v,+i = (5fl/e(F)). Then Vi+i{s) > Vi{s) for all states s G S, and 

Vi^i{s) > Vi{s) for some state s G {Ai H S*) \ Wi. 

Proof We first show that Vi+i > Vi. Let U = {Ai H S)\Wi. Let Wi{s) = 1 — Vi{s) for all states 
s € S. Since Vi = {{l))^^^{Safe{F)), it follows that for all states s e S and all moves 02 G r2{s), 
we have 



tes 

The selector ^i(s) chosen for 7^+1 at s E U satisfies that ^i(s) G OptSel(wi, s). It follows that for 
all states s G S and all moves 02 G ^2(5). we have 



It follows that the maximal probability with which player 2 can reach T against the strategy 7j_|_i is 
at most Wi. It follows that (s) < Vi+i{s). 

We now argue that for some state s £ U we have Vi+i{s) > Vi{s). Given the strategy 7j_|_i, 
consider a pure memoryless counter-optimal strategy 7r2 for player 2 to reach T. Since the selectors 
7i+i(s) at states s G t/ are obtained from the almost-sure strategy n in the turn-based game Gy. 
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to satisfy Safe(F), it follows from Lemma 2 that if for every state s G U, the action 7r2(s) G 
OptSelCount(t;j, s, 7i+i), then from all states s e f/, the game stays safe in F with probability 1. 
Since Ji^i is a given strategy for player 1, and tt2 is counter-optimal against Ji^i, this would 
imply that U C {s e S \ ((l))vai(Safe(F)) = 1}. This would contradict that Wi = {s e S \ 
((l))vai(Safe(F)) = 1} and ?7 n W^i = 0. It follows that for some state s* e C/ we have 772(5*) ^ 
OptSelCount(wj, s*,7i+i), and since 7i+i(s*) G OptSel(wi, s*) wehave 

Vi{s*) < ^Uj(t) ■5^,+i(s*, 772(5*)); 
tes 

in other words, we have 

m{s*) > ^Wi{t) ■ (5^,^i(s*, 772(5*)). 
tes 

Define a valuation z as follows: z{s) = Wi{s) for 5 ^ s*, and z{s*) = X]tGS^*(^) ' 
^7i+i(s*'^2(5*)). Hence z < Wi, and given the strategy and the counter-optimal strategy 
772, the valuation z satisfies the inequalities of the Unear-program for reachabiUty to T. It follows 
that the probability to reach T given 7 J is at most z. Since z < w^, it follows that Wi+i(s) > Vi{s) 
for all s € S, and Vi+i{s*) > Vi{s*). This concludes the proof. I 

We obtain the following theorem from Lemma 1 and Lemma 3 that shows that the sequences of 
values we obtain is monotonically non-decreasing. 

Theorem 3 (Monotonicity of values). For i > 0, let 7^ and ^i+i be the player- 1 selectors obtained 
at iterations i and i + 1 of Algorithm 1. Ifji ^ 7i+i, then ((1)) J^i {Safe(F)) < {Safe{F)). 

Theorem 4 (OptunaUty on termination). Let Vi be the valuation at iteration i of Algorithm 1 
such that Vi = {{l))2;^{Safe{F)). IfI={seS\ {Wi U T) | Prei{vi){s) > Vi{s)} = 0, and 
{Ai n S)\ Wi = 0, then 7, is an optimal strategy and Vi = ((l))vai {Safe{F)). 

Proof. We show that for all memoryless strategies 77i for player 1 we have ((l))^g^|(Safe(F)) < Vi. 
Since memoryless optimal strategies exist for concurrent games with safety objectives (Theorem 1) 
the desired result follows. 

Let 772 be a pure memoryless optimal strategy for player 2 in Gn for the objective complemen- 
tary to Safe(i^), where (G^., Safe(i^)) = TB(G,t!j,F). Consider a memoryless strategy 771 for 
player 1, and we define a pure memoryless strategy 772 for player 2 as follows. 

1. If 77i(s) ^ OptSel(?;i, s), then 772(5) = 6 S ^2(5), such that Pre^j(s)^b(t'i)(5) < Vi{s)\ (such a 
h exists since 7ri(s) ^ OptSel(wi, 5)). 

2. If 771(5) e OptSel(wi, 5), then let A = Supp{tti{s)), and consider B such that B = 
OptSelCount(t'i, 5, 771(5)). Then we have 772(5) = 6, suchthat7F2((5, A,i3)) = {s,A,b). 

Observe that by construction of 772, for all 5 G S \ {Wi U T), we have -Pre7ri(s),7r2(s)(vi)(5) < 
Vi{s). We first show that in the Markov chain obtained by fixing 771 and 772 in G, there is no closed 
connected recurrent set of states C such that C C S \ {WiUT). Assume towards contradiction that 
C is a closed connected recurrent set of states in 5 \ (Wi U T). The following case analysis achieves 
the contradiction. 

1. Suppose for every state 5 G G we have 771(5) G OptSel(t'i, 5). Then consider the strategy 

tFi in Gy- such that for a state s G G we have 7fi(5) = {s,A,B), where 77i(s) = A, and 
B = OptSelCount(fi, 5, 771(5)). Since G is closed connected recurrent states, it follows by 
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construction that for all states s G C in the game G„. we have Pr^^''^^(Safe(C)) = 1, where 
C = CU{{s,A,B) I s G C} U {(s, A,b)\s e C}. It follows that for all s G_C in we have 
PrJi ''2(Safe(F))_= 1. Since ttz is an optimal strategy, it follows that C C (Ai n S)\Wi. This 
contradicts that {Ai n S)\Wi = 9. 
2. Otherwise for some state s* e C we have 7ri(s*) ^ OptSel(wj, s*). Let r = mva.{q \ 
Uq{vi) n C 7^ 0}, i.e., r is the least value-class with non-empty intersection with C. 
Hence it follows that for all q < r, wc have Uq{vi) Ci C = 0. Observe that since for all 
s e C we have Pre^j(s) ,n.2(s)(i'i)(s) < Vi{s), it follows that for all s € Ur{vi) either 
(a) Dest{s, 7ri(s), 7r2(s)) C Ur{vi); or (b) Dest{s, tti{s), 7r2(s)) n Uq{vi) ^ 0, for some q < r. 
Since Ur{vi) is the least value-class with non-empty intersection with C, it follows that for 
all s e Ur{vi) we have Dest{s,Tri{s),Tr2{s)) C Ur{vi). It follows that C C Ur{vi). Con- 
sider the state s* G C such that 7ri(s*) ^ OptSel(wi, s). By the construction of tt2{s), we have 
^'?'e^i(s*),7r2(s*)(^^i)(s*) < Wi(s*). Hence we musthave£'esi(s*,7ri(s*),7r2(s*))n[/g(t;i) 7^ 0, 
for some q < r. Thus we have a contradiction. 

It follows from above that there is no closed connected recurrent set of states in 5 \ (Wi U T), 
and hence with probability 1 the game reaches Wi U T from all states in 5 \ (Wi U T). Hence 
the probability to satisfy Safe(F) is equal to the probability to reach Wi. Since for all states s G 
S \ {Wi U T) we have -Pre^j(5).„2(s) ^ ^iis), it follows that given the strategies tti and 
TT2, the valuation Vi satisfies all the inequalities for linear program to reach Wi. It follows that the 
probability to reach Wi from s is atmost Vi{s). It follows that for all s e 5* \ {Wi U T) we have 
((l))J3^i(Safe(F))(s) < Wj(s). The result follows. ■ 

Convergence. We first observe that since pure memoryless optimal strategies exist for turn-based 
stochastic games with safety objectives (Theorem 1), for turn-based stochastic games it suffices to 
iterate over pure memoryless selectors. Since the number of pure memoryless strategies is finite, 
it follows for turn-based stochastic games Algorithm 1 always terminates and yields an optimal 
strategy. For concurrent games, we will use the result that for e > 0, there is a k-uniform memoryless 
strategy that achieves the value of a safety objective with in e. We first define fc-uniform memoryless 
strategies. For a positive integer /c > 0, a selector ^ for player 1 is k-uniform if for all s ^ S\ (TUVKi ) 
and all a e Supp{'K\{s)) there exists i, j G N such that Q < i < j < k and ^(s)(a) = 4, i.e., the 
moves in the support are played with probability that are multiples of j with £ < k. A memoryless 
strategy is fc-uniform if it is obtained from a fc-uniform selector. We first present a technical lemma 
(Lemma 4) that will be used in the key lemma (Lenoma 5) to prove the convergence result. 

Lemma 4. Let ai,a2, ■ ■ ■ ,am be m real numbers such that (1) for alll < i < m, we have ai > 0; 

and (2) X^I^i '^i = 1- ^ = mini<i<m a^. For 77 > 0, there exists k > and m real numbers 
61, 62, • • • , such that (1) for all 1 < i < m, we have bi is a multiple of ^ and hi > 0; (2) 
l^iLi ~ 1'' (3) for all 1 < i < m, we have p- < 1 + and ^ < 1 + Jy. 

Proof. Let ^ — For 1 < i < m, define hi such that bi is a multiple of j and ai <bi < + | 

(basically define bi as the least multiple of j that is at least the value of a^). For 1 < z < m, 

let bi = j^' - ; i.e., bi is defined from 6, with normalization. Clearly, ^ bi = 1, and for all 

1 < i < m, we have 6, > and bi can be expressed as a multiple of p for some k > We have 
the following inequalities: for all 1 < i < m, we have 

bi<ai + ]; h>Yfln- 
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The first inequality follows since bi < at + j and J27Li h > YlT^i = 1- The second inequality 
follows since h > ai and X^^i 6i < Y^T=ii°-i + i) = Z^ili + x = ^ + T- Hence for all 
1 < i < m, we have 

6i 1 1 

-<l + i — <i + ^<i+r?; 
ai t ■ tti I ■ c 



a,- TO 

T^<l + ^<l + r?-c<l + r?. 

bi £ 



The desired result follows. 



Lemma 5. For all concurrent game graphs G, for all safety objectives Safe{F), for F C S, for all 
s > 0, there exist k > and k-uniform selectors ^ such that ^ is an e-optimal strategy. 

Proof Our proof uses a result of Solan [19] and the existence of memory less optimal strategies for 
concurrent safety games (Theorem 1). We first present the result of Solan specialized for MDPs with 

reachability objectives. 

The result of [19]. Let G = {S, M, F^, 5) and G' = {S, M, 1^2, S') be two player-2 MDPs defined 
on the same state space S, with the same move set M and the same move assigrmient function / 2> 
but with two different transition functions S and 6', respectively. Let 

piG,G')= max 

where by convention x/0 = +oo for x > 0, and 0/0 = 1 (compare with equation (9) of [19]: 
p{G, G') is obtained as a specialization of (9) of [19] for MDPs). Let T C 5. For s G S, let v{s) 
and v'{s) denote the value for player 2 for the reachabihty objective Reach(T) from s in G and G', 
respectively. Then from Theorem 6 of [19] (also see equation (10) of [19]) it follows that 

-4 . \S\ . p(G, G') < vis) - As) < ^,_V%fiG%)^ -' 

where a;+ = max{x, 0}. We first explain how specialization of Theorem 6 of [19] yields (1). Theo- 
rem 6 of [19] was proved for value functions of discounted games with costs, even when the discount 
factor A = 0. Since the value functions of hmit-average games are obtained as the limit of the value 
functions of discounted games as the discount factor goes to [18], the result of Theorem 6 of [19] 
also holds for concurrent limit-average games (this was the main result of [19]). Since reachability 
objectives are special case of limit-average objectives. Theorem 6 of [19] also holds for reachability 
objectives. In the special case of reachabihty objectives with the same target set, the different cost 
functions used in equation (10) of [19] coincide, and the maximum absolute value of the cost is 1. 
Thus we obtain (1) as a specialization of Theorem 6 of [19]. 

We now use the existence of memoryless optimal strategies in concurrent safety games, and 
(1) to obtain our desired result. Consider a concurrent safety game G = {S, M, Fi, S) with 
safe set F for player 1. Let tti be a memoryless optimal strategy for the objective Safe(F). Let 
c = min^gs ,iieri(s){7''i(s)(ai) | 7ri(s)(ai) > 0} be the minimum positive transition probabihty 
given by tti. Given e > 0, let ry = minj^^jgy, g:^}- We define a memoryless strategy n[ satisfying 
the following conditions: for s G 5 and ai e A (s) we have 

1. if 7ri(s)(ai) = 0, then n[{s){ai) = 0; 

2. if 7ri(s)(ai) > 0, then following conditions are satisfied: 
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(a) <(s)(ai) >0; 

(b) < 1 + r/; 

(c) "^^^^<°^^ < 1 + r/; and 

^ ■7ri(s)(ai) — " 

(d) ttJ (s)(ai) is a multiple of ^ for an integer fc > (such a k exists for > ^^)- 

For /c > such a strategy tt^ exists (follows from the construction of Lemma 4). Let Gi and 
G'l be the two player-2 MDPs obtained from G by fixing the memoryless strategies tti and ttJ, 
respectively. Then by definition of n'-^ we have p{Gi, G[) < rj. Let T = S \ F. For s G S, let the 
value of player 2 for the objective Reach(T) in Gi and G'^ be v{s) and ^^Xs)' respectively. By (1) 
we have 

-4.|S|.,<„(,)-„'(,)<^5^J|j^; 

Observe that by choice of r] we have (a) 4 • \S\ ■ rj < and (b) 2 • l^j ■ rj < ^. Hence we have 
—e < v{s) — v'{s) < s. Since tti is a memoryless optimal strategy, it follows that Tr[ is a fc-uniform 
memoryless £-optimal strategy. I 

Strategy improvement with fc-uniform selectors. We first argue that if we restrict Algorithm 1 

such that every iteration yields a A:-uniform selector, for fc > 0, then the algorithm terminates. For 
fc > 0, the restriction of Algorithm 1 to A:-uniform selectors means that instead of considering 
all possible selectors for player 1, the algorithm restricts player 1 to select among the fc-uniform 
selectors. The basic argument that if Algorithm 1 is restricted to fc-uniform selectors for player 1, for 
fc > 0, then the algorithm terminates, follows from the fact that the number of fc-uniform selectors 
for a given k is finite. A more formal argument is as follows: if we restrict player 1 to chose between 
fc-uniform selectors, then a concurrent game graph G can be converted to a turn-based stochastic 
game graph, where player 1 first chooses a fc-uniform selector, then player 2 chooses an action, and 
then the transition is determined by the chosen fc-uniform selector of player 1, the action of player 2 
and the transition function S of the game graph G. Then by termination of turn-based stochastic 
games it follows that the algorithm will terminate. Given fc > 0, let us denote by the valuation of 
Algorithm 1 at iteration i, where the selectors for player 1 are restricted to be fc-uruform. This gives 
us the following lemma. 

Lemma 6. For all fc > 0, there exists i>0 such that = z^_^-^. 

Lemma 7. For all concurrent game graphs G, for all safety objectives Safe{F), for F C S, for all 
e > 0, there exist k > andi > such that for all s € Swe have zf{s) > {{l))v3\{Safe{F)){s) — e. 

Proof. By Lemma 5, for all e > 0, there exists fc > such that there is a fc-uniform memoryless 
e-optimal strategy for player 1. By Lermna 6, for all fc > 0, there exists a i > such that z\ = -zf+i, 
and it represents the maximal value obtained by fc-uniform memoryless sttategies. Hence it follows 
that there exists fc > and z > such that for all s G 5 we have z^{s) > ((l))vai(Safe(F))(s) - e. 
The desired result follows. I 

Theorem 5 (Convergence). Let Vi be the valuation obtained at iteration i of Algorithm 1. Then the 

following assertions hold. 

1. For all e > 0, there exists i such that for all s we have Vi{s) > {{l))va\{Safe{F)){s) — e. 

2. linii^ooVi = {{l))^a\{Safe{F)). 
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Proof. We prove both the results as follows. 

1. Let Vi is the valuation of Algorithm 1 at iteration i (without any restriction). Since Vi is obtained 
without the restriction of fc-uniformity of selectors, it follows that for all fc > 0, for alH > 0, 
we have < Vi. From Lemma 7 it follows that for all e > 0, there exists a fc > and i > 
such that for all s we have z^{s) > ((l))vai(Safe(F))(s) — s. Hence we have that for all £ > 0, 
there exists i>0, such that for all s e 5 we have Vi (s) > ((l))vai (Safe(-F)) (s) — e (this follows 
since Vi > 

2. By Theorem 3 for alH > we have Vi^i > Vi. By part (1), for all £ > 0, there exists i > 
such that for all s e 5 we have Vi{s) > ((l))vai(Safe(i^))(s) — e. Hence it follows that for 
all £ > 0, there exists i > such that for all j > i and for all s G S we have Vj{s) > 
((l))vai(Safe(F))(.s) - e. It follows that lim.^oo v, = ((l))vai(Safe(i^)). 

This gives us the following result. I 

Complexity. Algorithm 1 may not terminate in general; we briefly describe the complexity of ev- 
ery iteration. Given a valuation Vi, the computation of Pre-i{vi) involves the solution of matrix 
games with rewards Vi; this can be done in polynomial time using linear programming. Given Vi, if 
Prei (vi) = Vi, the sets OptSel(i'j, s) and OptSelCount(i'j, s) can be computed by enumerating the 
subsets of available actions at s and then using linear-programming. For example, to check whether 
{A, B) e OptSelCount(wi, s) it suffices to check both of these facts: 

1 . (A is the support of an optimal selector ^i). there is an selector such that (i) is optimal (i.e. 
for all actions 6 6 12 (s) we have Pre^^ ,b{vi) (s) > Vi (s)); (ii) for all a € ^ we have (a) > 0, 
and for all a ^ A we have (a) = 0; 

2. (B is the set of counter- optimal actions against ^ij. for all 6 € .B we have Pre(^^^i){vi){s) = 
Vi{s), and for all 6 ^ _B we have Fre^j.b(?;,;)(,s) > Vi{s). 

All the above checks can be performed by checking feasibility of sets of linear equalities and inequal- 
ities. Hence, TB(G, Vi, F) can be computed in time polynomial in size of G and Vi and exponential 
in the number of moves. We observe that the construction is exponential only in the number of moves 
at a state, and not in the number of states. The number of moves at a state is typically much smaller 
than the size of the state space. We also observe that the improvement step 3.3.2 requires the compu- 
tation of the set of almost-sure winning states of a turn-based stochastic safety game: this can be done 
both via linear-time discrete graph-theoretic algorithms [3], and via symbolic algorithms [9]. Both 
of these methods are more efficient than the basic step 3.4 of the improvement algorithm, where the 
quantitative values of an MDP must be computed. Thus, the improvement step 3.3 of Algorithm 1 is 
in practice not inefficient, compared with the standard improvement steps 3.2 and 3.4. 

5 Termination for Approximation and Hirn-based Games 

In this section we present termination criteria for strategy improvement algorithms for concurrent 
games for e-approximation, and then present an improved termination condition for turn-based 
games. 

Termination for concurrent games. A strategy improvement algorithm for reachability games was 
presented in [2]. We refer to the algorithm of [2] as the reachability strategy improvement algorithm. 
The reachability strategy improvement algorithm is simpler than Algorithm 1: it is sinnilar to Algo- 
rithm 1 and in every iteration only Step 3.2 is executed (and Step 3.3 need not be executed). Apply- 
ing the reachability strategy improvement algorithm of [2] for player 2, for a reachability objective 
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Reach(T), we obtain a sequence of valuations (wi)i>o such that (a) Uj+i > Ui, (b) if itj+i = Ui, 
then Ui = ((2))vai (Reach(T)); and (c) linii^oo = ((2))vai (Reach(T)). Given a concurrent game G 
with F C S and T = S\F,we apply the reachability strategy improvement algorithm to obtain the 
sequence of valuation (wi)i>o as above, and we apply Algorithm 1 to obtain a sequence of valuation 
{vi)i>o. The termination criteria are as follows: 

1. if for some i we have Wj+i = Ui, then we have Ui = ((2))vai(Reach(T)), and 1 — Ui = 
((l))vai(Safe(F)), and we obtain the values of the game; 

2. if for some i we have Vi^i = vi, then we have 1 — Vi = ((2))vai(Reach(T)), and Vi = 
((l))vai(Safe(i^)), and we obtain the values of the game; and 

3. for e > 0, if for some i > 0, we have Ui + Vi > 1 — e, then for all s £ 5 we have Vi{s) > 
((l))vai(Safe(F))(s) - s and Ui{s) > ((2))vai(Reach(r))(s) - e (i.e., the algorithm can stop for 
e-approximation) . 

Observe that since {ui)i>a and (wi)i>o are both monotonically non-decreasing and 
((l))vai(Safe(F)) + ((2))vai(Reach(r)) ^ 1, it follows that if + v^ > 1 - e, then forall 
j > i we have Ui > Uj — e and Vi > vj — e. This estabhshes that u, > ((l))vai(Safe(F)) — e and 
i>i > ((2))vai(Reach(T)) — e; and the correctness of the stopping criteria (3) for e-approximation 
follows. We also note that instead of applying the reachability strategy improvement algorithm, a 
value-iteration algorithm can be applied for reachabihty games to obtain a sequence of valuation 
with properties similar to (ui)i>o and the above termination criteria can be appUed. 

Theorem 6. Let G be a concurrent game graph with a safety objective Safe{F). Algorithm 1 and the 
reachability strategy improvement algorithm for player! for the reachability objective Reach{S \ F) 
yield sequence of valuations {vi)i>o and {ui)i>(j, respectively, such that (a) for all i > 0, we have 
Vi < ((l))vai(5a/e(F)) < 1 - Ui,- and(b) limj^oo Vi = limj^oo 1 - = {{l))va\ (Safe (F)). 

Termination for turn-based games. For turn-based stochastic games Algorithm 1 and as well as the 
reachability strategy improvement algorithm terminates. Each iteration of the reachability strategy 
improvement algorithm of [2] is computable in polynomial time, and here we present a termination 
guarantee for the reachability strategy improvement algorithm. To apply the reachability strategy 
improvement algorithm we assume the objective of player 1 to be a reachability objective Reach(T), 
and the correctness of the algorithm relies on the notion of proper strategies. Let W2 = {s G S \ 
((l))vai(Reach(r))(s) = 0}. Then the notion of proper strategies audits properties are as follows. 

Definition 4 (Proper strategies and selectors). A player-1 strategy tti is proper if for all player-2 
strategies 1^2, and for all states s & S\{TyjW2), we /lave Pr^''''^(/?eac/i(ru W2)) = 1. A player-1 
selector ^1 is proper if the memoryless player-1 strategy is proper. 

LemmaS ([2]). Given a selector for player 1 , the memoryless player-1 strategy is proper ifffor 
every pure selector ^2 for player 2, and for all states s €: S,we have Prf^'^^ {ReachiT U W2)) = 1. 

The following result follows from the result of [2] speciaUzed for the case of turn-based stochas- 
tic games. 

Lemma 9. Let G be a turn-based stochastic game with reachability objective Reach{T) for player 1. 
Let 7o be the initial selector, and 7^ be the selector obtained at iteration i of the reachability strategy 
improvement algorithm. Ifji is a pure, proper selector, then the following assertions hold: 

1. for all i > 0,we have 7^ is a pure, proper selector; 
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2. for all i > 0, we have Uj+i > Uj, where Ui = {{l))^^^{Reach{T)) and Uj+i = 
{{l)f,;+'{Reach{T));and 

3. ifui+i = Ui, then Ui = ((l))vai {Reach{T)), and there exists i such that Uj+i = Ui. 

The strategy improvement algorithm of Condon [5] works only for halting games, but the reach- 
ability strategy improvement algorithm works if we start with a pure, proper selector for reachability 
games that are not halting. Hence to use the reachability strategy improvement algorithm to com- 
pute values we need to start with a pure, proper selector. We present a procedure to compute a pure, 
proper selector, and then present termination bounds (i.e., bounds on i such that Ui+i = Ui). The 
construction of pure, proper selector is based on the notion of attractors defined below. 

Attractor strategy. Let Aq = W2 U T, and for i > we have 

Ai+i =AiU{seSiUSR \ E{s) n 7^ 0} U {s G I E{s) C Ai}. 

Since for all s e S \ W2 we have ((l))vai(Reach(T)) > 0, it follows that from all states in S* \ 1^2 
player 1 can ensure that T is reached with positive probabihty. It follows that for some i > we 
have Ai = S. The pure attractor selector ^* is as follows: for a state s G (^i+i \ Ai) tl Si we 
have C*(s)(t) = 1, where t <E Ai (such a t exists by construction). The pure memoryless strategy ^* 
ensures that for all i > 0, from Ai+i the game reaches Ai with positive probability. Hence there is 
no end-component C contained in 5 \ {W2 U T) in the MDP G^. It follows that ^* is a pure selector 
that is proper, and the selector ^* can be computed in 0(|-B|) time. This completes the reachability 
strategy improvement algorithm for turn-based stochastic games. We now present the termination 
bounds. 

Termination bounds. We present termination bounds for binary turn-based stochastic games. A turn- 
based stochastic game is binary if for all s G 5_r we have |i?(s)| < 2, and for all s G Sr if 
\E{s)\ = 2, then for all t G E{s) we have S{s){t) = ^, i.e., for all probabilistic states there are at 
most two successors and the transition function S is uniform. 

Lemma 10. Let G be a binary Markov chain with \ S \ states with a reachability objective Reach{T). 
Thenforalls G Swehave {{l)%3\{Reach{T)) = 2 with p, q e N and p, q < 4l'^l-^ 

Proof. The results follow as a special case of Lemma 2 of [5]. Lemma 2 of [5] holds for halting 
turn-based stochastic games, and since Markov chains reaches the set of closed connected recurrent 
states with probability 1 from all states the result follows. I 

Lemma 11. Let G be a binary turn-based stochastic game with a reachability objective Reach{T). 
Then for all 3 ^ S we have {{l))^s\{Reach{T)) = ^, with p,q £ N and p, q < 4^^"^^^. 

Proof. Since pure memoryless optimal strategies exist for both players (Theorem 1), we fix pure 
memoryless optimal strategies tti and 772 for both players. The Markov chain ^,^,,72 can be then 
reduced to an equivalent Markov chains with \Sb\ states (since we fix deterministic successors for 
states in U S2, they can be collapsed to their successors). The result then follows from Lemma 10. 
I 

From Lemma 1 1 it follows that at iteration i of the reachability strategy improvement algorithm 
either the sum of the values either increases by or else there is a valuation Ui such that Wj+i = 

Ui. Since the sum of values of all states can be at most \S\, it follows that algorithm terminates in 
at most \S\ • 4l'^«l~^ steps. Moreover, since the number of pure memoryless strategies is at most 
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riseSi 1-^(^)1' the algorithm terminates in at most HsgSi \^{^)\ steps. It follows from the results 
of [20] that a turn-based stochastic game graph G can be reduced to a equivalent binary turn-based 
stochastic game graph G' such that the set of player 1 and player 2 states in G and G' are the same 
and the number of probabilistic states in G' is 0{\6\), where \6\is the size of the transition function 
in G. Thus we obtain the following result. 

Theorem 7. Let G be a turn-based stochastic game with a reachability objective ReachiT), then 
the reachability strategy improvement algorithm computes the values in time 

0(min{n \E{s)\,2om^ . poly{\G\); 
seSi 

where poly is polynomial function. 

The results of [14] presented an algorithm for turn-based stochastic games that works in time 
Od^iil! • poly{\G\)). The algorithm of [14] works only for turn-based stochastic games, for general 
turn-based stochastic games the complexity of the algorithm of [14] is better. However, for turn- 
based stochastic games where the transition function at all states can expressed in constant bits we 
have I (5 1 = 0{\Sii\). In these cases the reachability strategy improvement algorithm (that works for 
both concurrent and turn-based stochastic games) works in time 2'^(l'^^l^ • poly{\G\) as compared to 
the time 20(I^h| i°s(I^rI) ■ poly{\G\) of the algorithm of [14]. 
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